How to Access Big Data Cluster?
How to use Spark?

How to Access Big Data Cluster?

  1. Request for the Big data user access credentials : intranet.iitjammu.ac.in
  2. You will be provided user credentials after approval from authorities to access your Big data user account:
    Credentials: USERNAME, PASSWORD, IP_Address.
  3. SSH the machine using Moba xterm. https://mobaxterm.mobatek.net/download.html
    Click on -> Session -> SSH -> IP and Username -> Enter password on terminal
  4. Write the mapper and reducer code that you want to run on the Hadoop cluster. You can follow the link for the word count program. Click Here
  5. Upload the mapper and reducer to your BigData user space using mobaXterm.
    Run map reduce program for wordcount:
    1. Create a file data.txt with sample data
    2. Upload data.txt to HDFS(hadoop distributed file system)
      1. Create a directory in HDFS:
        hdfs dfs -mkdir data
      2. Put data.txt(stored in local) into that folder in HDFS:
        hdfs dfs -put /home/USENAME/data.txt data
        hdfs dfs -put .
      3. Run mapper.py and reducer.py on data.txt to perform word count.
        hadoop jar /usr/hdp/3.1.4.0-315/hadoop-mapreduce/hadoop-streaming.jar -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py -input input/data.txt -output output

Brief Explanation:

hadoop jar
-file
-mapper
-file < Location of mapper in local>
-reducer
-input < location of input data in HDFS>
-output

Example for mapper.py and reducer.py

mapper.py

#!/usr/bin/python
import sys
#Word Count Example
# input comes from standard input STDIN
for line in sys.stdin: line = line.strip() #remove leading and trailing whitespaces
words = line.split() #split the line into words and returns as a list
for word in words:
#write the results to standard output STDOUT
print'%s %s' % (word,1) #Emit the word

reducer.py

#!/usr/bin/python
import sys
from operator import itemgetter
# using a dictionary to map words to their counts
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
line = line.strip()
word,count = line.split(' ',1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print '%s %s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s %s' % (current_word,current_count)
hadoop jar /usr/hdp/3.1.4.0-315/hadoop-mapreduce/hadoop-streaming.jar -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py -input input/data.txt -output output/output746